note: I used the Lux theme for this document, which I got here
First, load the data and packages
library(tidyverse)
library(ggplot2)
library(aod)
library(plotly)
dat <- read.csv("../../Data/GradSchool_Admissions.csv")
Next, get an overview of the data
head(dat)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
Okay, we see we are working with school admittions, where admit is accepted (1) or not accepted (0), their gre test score, gpa, and school rank where 1 is the highest rank
Let’s change rank to a factor
dat$rank <- factor(dat$rank)
Create a model
mod <- glm(admit ~ gre + gpa + rank, data = dat, family = "binomial")
mod
##
## Call: glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = dat)
##
## Coefficients:
## (Intercept) gre gpa rank2 rank3 rank4
## -3.989979 0.002264 0.804038 -0.675443 -1.340204 -1.551464
##
## Degrees of Freedom: 399 Total (i.e. Null); 394 Residual
## Null Deviance: 500
## Residual Deviance: 458.5 AIC: 470.5
View summary of the model
summary(mod)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
What about an R-squared value?
summary(lm(mod))$r.squared
## [1] 0.1004006
Okay, pretty low… Our model might not be the best but don’t sweat the small stuff
Use the confint function to obtain confidence intervals (CIs) for the coefficient estimates
confint(mod)
## 2.5 % 97.5 %
## (Intercept) -6.2716202334 -1.792547080
## gre 0.0001375921 0.004435874
## gpa 0.1602959439 1.464142727
## rank2 -1.3008888002 -0.056745722
## rank3 -2.0276713127 -0.670372346
## rank4 -2.4000265384 -0.753542605
We can exponentiate the coefficients and interpret them as odds-ratios. We can use the same logic to get probability ratios and their confidence intervals, by exponentiating the confidence intervals from before
exp(coef(mod))
## (Intercept) gre gpa rank2 rank3 rank4
## 0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375
To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise
## odds ratios and 95% CI
exp(cbind(OR = coef(mod), confint(mod)))
## OR 2.5 % 97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre 1.0022670 1.000137602 1.0044457
## gpa 2.2345448 1.173858216 4.3238349
## rank2 0.5089310 0.272289674 0.9448343
## rank3 0.2617923 0.131641717 0.5115181
## rank4 0.2119375 0.090715546 0.4706961
We interpret this to mean that for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 2.23
Let’s use predicted probabilities to help us understand the model. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions
First, calculate the predicted probability of admission at each value of rank, holding gre and gpa at their means, and view the new data frame
newdata1 <- with(dat, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
## view data frame
newdata1
## gre gpa rank
## 1 587.7 3.3899 1
## 2 587.7 3.3899 2
## 3 587.7 3.3899 3
## 4 587.7 3.3899 4
Great, now we have the data frame we want to use to calculate the predicted probabilities
Let’s add the probablility values to the data set
newdata1$rankP <- predict(mod, newdata = newdata1, type = "response")
newdata1
## gre gpa rank rankP
## 1 587.7 3.3899 1 0.5166016
## 2 587.7 3.3899 2 0.3522846
## 3 587.7 3.3899 3 0.2186120
## 4 587.7 3.3899 4 0.1846684
By reading the above chart, we understand that (while holding gre and gpa at their means) the predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1)
We can do something very similar to create a table of predicted probabilities varying the value of gre and rank. To plot these, we will create 100 values of gre between 200 and 800, at each value of rank (i.e., 1, 2, 3, and 4)
newdata2 <- with(dat, data.frame(gre = rep(seq(from = 200, to = 800, length.out = 100),
4), gpa = mean(gpa), rank = factor(rep(1:4, each = 100))))
Then, we generate the predicted probabilities except we are also going to ask for standard errors (se = TRUE) so we can plot a confidence interval. We get the estimates on the link scale and back transform both the predicted values and confidence limits into probabilities.This gives us our final data set, with all the values we want
newdata3 <- cbind(newdata2, predict(mod, newdata = newdata2, type = "link",
se = TRUE))
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
head(newdata3)
## gre gpa rank fit se.fit residual.scale UL LL
## 1 200.0000 3.3899 1 -0.8114870 0.5147714 1 0.5492064 0.1393812
## 2 206.0606 3.3899 1 -0.7977632 0.5090986 1 0.5498513 0.1423880
## 3 212.1212 3.3899 1 -0.7840394 0.5034491 1 0.5505074 0.1454429
## 4 218.1818 3.3899 1 -0.7703156 0.4978239 1 0.5511750 0.1485460
## 5 224.2424 3.3899 1 -0.7565919 0.4922237 1 0.5518545 0.1516973
## 6 230.3030 3.3899 1 -0.7428681 0.4866494 1 0.5525464 0.1548966
## PredictedProb
## 1 0.3075737
## 2 0.3105042
## 3 0.3134499
## 4 0.3164108
## 5 0.3193867
## 6 0.3223773
Make a new plot with the new data set, showing the predicted probability that students were accepted given their gre and school ranking
p <- ggplot(newdata3, aes(x = gre, y = PredictedProb)) + geom_ribbon(aes(ymin = LL,
ymax = UL, fill = rank), alpha = 0.2) + geom_line(aes(colour = rank),
size = 1)
p
That looks’s a little boring, so let’s use an interactive plot using the plotly package
fig <- plot_ly( type = 'scatter', mode = 'markers')
fig <- fig %>%
add_trace(data = newdata3,
x = ~gre,
y = ~PredictedProb,
hoverinfo = 'image',
color = newdata3$rank,
labels = newdata3$rank,
showlegend = TRUE
)
fig
Whether we use the first or the second graph, we can see the relationships between the given varialbes. There appears to be a positive relationship between gre, rank and probability of being accepted. In other words, student’s who had a higher gre and rank had a higher change of acceptance than those who scored lower and came from schools of less prestige.